Skip to content

Extending gpfdist in Cloudberry Database to Support SFTP Protocol for…#1226

Open
ZTE-EBASE wants to merge 28 commits intoapache:mainfrom
ZTE-EBASE:temp_cloudberry
Open

Extending gpfdist in Cloudberry Database to Support SFTP Protocol for…#1226
ZTE-EBASE wants to merge 28 commits intoapache:mainfrom
ZTE-EBASE:temp_cloudberry

Conversation

@ZTE-EBASE
Copy link
Copy Markdown

@ZTE-EBASE ZTE-EBASE commented Jul 12, 2025

… Data Ingestion

gpfdist is a file distribution program in Cloudberry that can parallel load external data into the database. However, it has the drawback that data files must reside on the same machine as the tool. Therefore,extending it to support the SFTP protocol can address the above drawback and enable loading files from a remote server.

Fixes #ISSUE_Number

What does this PR do?

By extending the gpfdist tool to support the SFTP protocol, remote data loading has been achieved, overcoming the challenge of having the tool and data files on the same machine.

Type of Change

New feature (non-breaking change)

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:
The ssh2 library needs to be introduced during compilation and placed under /usr/local.

Checklist

Additional Context

Under this approach, the location template for the external table is:

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://ip:port/<sftp://sftp-user:passwd@sftp-hostip:sftp-port/file.csv>') format 'csv' (DELIMITER '|');

Related Test Case:
1 Start gpfdist

[cdbberry@node196 ~]$ gpfdist -d /home/cdbberry/ -p 9876 -l gpfdist.log &
[1] 83161
[cdbberry@node196 ~]$ 2025-07-12 14:49:21 83161 INFO Before opening listening sockets - following listening sockets are available:
2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
Serving HTTP on port 9876, directory /home/cdbberry

2 create table (external)

CREATE table test(
id int,
name varchar(20)
);

CREATE external table testww(
id int,
name varchar(20)
)
location 
('gpfdist://10.229.89.196:9876/<sftp://xxx:xxxx@xxx:22/xx.csv>')
format 'csv' (delimiter as '|' NULL as '' FILL MISSING FIELDS) SEGMENT REJECT LIMIT 2 ROWS;

3 data load

 insert into test select * from test_ext;

4 result

postgres=# insert into test select * from test_ext;
INSERT 0 10
postgres=# select * from test;
 id |   name    
----+-----------
  2 | ZTE-EBASE
  3 | ZTE-EBASE
  4 | ZTE-EBASE
  6 | ZTE-EBASE
  7 | ZTE-EBASE
  8 | ZTE-EBASE
  9 | ZTE-EBASE
 10 | ZTE-EBASE
  1 | ZTE-EBASE
  5 | ZTE-EBASE
(10 rows)

cat test.csv
1|ZTE-EBASE
2|ZTE-EBASE
3|ZTE-EBASE
4|ZTE-EBASE
5|ZTE-EBASE
6|ZTE-EBASE
7|ZTE-EBASE
8|ZTE-EBASE
9|ZTE-EBASE
10|ZTE-EBASE

The amount and content of the table data are consistent with the file.

CI Skip Instructions


Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants